Zurich and Vaud by the Numbers - Predictive Insights into Tourism Dynamic
Authors
Affiliation
Valeriia Bilousko, Urs Hurni, Jayesh Smith
University of Lausanne
Anastasia Pushkarev, Emeline Raimon
Published
May 21, 2024
Abstract
The following Forecasting project focuses on forecasting the overnight stays of visitors in the Swiss canton of Vaud and from the Philippines to Zurich from October 2023 to December 2024. Utilizing comprehensive tourism data, the project reveals insights into the trends and seasonal patterns in both regions, emphasizing the varied impacts of the COVID-19 pandemic on international tourism dynamics. The forecasting phase employs Error, Trend, Seasonality (ETS) and AutoRegressive Integrated Moving Average (ARIMA) models, which are chosen based on their ability to accurately capture underlying data patterns, including seasonality and trend shifts due to global events. The project’s findings provide actionable insights for tourism management and strategic planning in the regions studied, highlighting the need for adaptive strategies in response to global disruptions such as pandemics.
1 Exploration & Visualization
1.1 Objectives
The main objectives of this project is to predict :
The overnight stays of the visitors in Vaud, from October 2023 until December 2024.
The overnight stays of visitors from Philippines to Zürich, from October 2023 until December 2024.
2 Cleaning & Wrangling
Click to show code
# Load the data in folder data named Dataset_tourism.xlsx)tourism_data <- readxl::read_xlsx(here("data/Dataset_tourism.xlsx"))#removing value 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the total#tourism_data <- tourism_data %>% filter(Herkunftsland != "Herkunftsland - Total")#print unique values in month columnunique(tourism_data$Monat)#> [1] "Januar" "Februar" "März" "April" "Mai" #> [6] "Juni" "Juli" "August" "September" "Oktober" #> [11] "November" "Dezember"# change ' [1] "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli" "August" "September" "Oktober" "November" "Dezember" into english month'tourism_data$Monat <- tourism_data$Monat %>%recode_factor("Januar"="January","Februar"="February","März"="March","April"="April","Mai"="May","Juni"="June","Juli"="July","August"="August","September"="September","Oktober"="October","November"="November","Dezember"="December")#add date type column for plotting purposestourism_data <- tourism_data %>%mutate(Date =dmy(paste("01", Monat, Jahr)))# filtering out 'Herkunftsland - Total' in column 'Herkunftsland' as it is just the totaltourism_data_no_total <- tourism_data %>%filter(Herkunftsland !="Herkunftsland - Total")#check for NANsum(is.na(tourism_data_no_total))#> [1] 51395#analyse the NAN values, where are they(tourism_data_no_total %>%filter(is.na(value)))#> # A tibble: 51,395 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Schwe~ Janu~ 2005 NA 2005-01-01#> 2 Zypern Schwe~ Janu~ 2005 NA 2005-01-01#> 3 Mexiko Schwe~ Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Schwe~ Janu~ 2005 NA 2005-01-01#> 5 Bahrain Schwe~ Janu~ 2005 NA 2005-01-01#> 6 Katar Schwe~ Janu~ 2005 NA 2005-01-01#> 7 Kuwait Schwe~ Janu~ 2005 NA 2005-01-01#> 8 Australien Schwe~ Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Schwe~ Janu~ 2005 NA 2005-01-01#> 10 Oman Schwe~ Janu~ 2005 NA 2005-01-01#> # i 51,385 more rows
2.0.1 Tourism Data - Vaud
Given the two objectives of the project, we are going to filter the initial dataset in order to keep and analyse only the cantons of interest. We start by filtering the “Kanton” column to keep only the canton of Vaud. Subsequently, we filter the “Herkunftsland” column by “Herkunftsland - Total.” This step ensures that our analysis encompasses data representing the aggregate number of tourists from all origins, providing a comprehensive overview of tourism activity in the canton of Vaud. There are no missing values present in the filtered dataset.
Click to show code
# Filter by canton Vaud tourism_vaud <- tourism_data %>%filter(Kanton =="Vaud")#check for NANsum(is.na(tourism_vaud))#> [1] 1869
2.0.2 Tourism Data - Zurich and Philippines
The data Zurich contained the total number of overnight stays in Zurich by tourists from different countries. We are interested in the number of overnight stays by tourists from the Philippines. See appendix for more details on the table.
Thus, we are filtering the “Kanton” column and the ‘Herkunftsland’ column, keeping “Zurich” and “Philippinen” for the country of origin.
Click to show code
#filter column 'Kanton' for Zurichtourism_data_zurich <- tourism_data_no_total %>%filter(Kanton =="Zürich")#check for NANsum(is.na(tourism_data_zurich))#> [1] 1869#analyse the NAN values, where are theytourism_data_zurich %>%filter(is.na(value))#> # A tibble: 1,869 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Malta Zürich Janu~ 2005 NA 2005-01-01#> 2 Zypern Zürich Janu~ 2005 NA 2005-01-01#> 3 Mexiko Zürich Janu~ 2005 NA 2005-01-01#> 4 Übriges Zentralamerika, Karib~ Zürich Janu~ 2005 NA 2005-01-01#> 5 Bahrain Zürich Janu~ 2005 NA 2005-01-01#> 6 Katar Zürich Janu~ 2005 NA 2005-01-01#> 7 Kuwait Zürich Janu~ 2005 NA 2005-01-01#> 8 Australien Zürich Janu~ 2005 NA 2005-01-01#> 9 Neuseeland, Ozeanien Zürich Janu~ 2005 NA 2005-01-01#> 10 Oman Zürich Janu~ 2005 NA 2005-01-01#> # i 1,859 more rowstourism_data_zurich_philippines <- tourism_data_zurich %>%filter(Herkunftsland =="Philippinen")#show table using reactabletable_zh_ph <-reactable(tourism_data_zurich_philippines, searchable =TRUE,defaultPageSize =5,sortable =TRUE,filterable =TRUE,pagination =TRUE,bordered =TRUE,highlight =TRUE,striped =TRUE,compact =TRUE,showPageSizeOptions =TRUE,showSortable =TRUE )
Filtering for Philippinen solved the problem of missing data we had with all countries of origin. The overnight stays are all included throughout the period.
3 EDA - Vaud
3.1 Visitors in Vaud
Graphical view of total number of tourists in canton Vaud :
Click to show code
tourism_vaud_total <- tourism_vaud %>%filter(Herkunftsland =='Herkunftsland - Total') %>%select(-c(Herkunftsland, Kanton, Monat, Jahr))# Create the ggplot object with viridis color paletteplot_vaud_total <- tourism_vaud_total %>%ggplot(aes(x = Date, y = value)) +geom_line(color =viridis(1)) +# Use viridis color palette for a single linelabs(x ="Date", y ="Number of tourists", title ="Total number of tourists in canton Vaud") +theme_minimal()# Convert to an interactive plotly object with specified width and heightinteractive_plot_total <-ggplotly(plot_vaud_total, width =600, height =300)# Adjust plotly settingsinteractive_plot_total %>%layout(margin =list(l =60, r =60, b =60, t =80),showlegend =FALSE# Remove legend if any )
The plot shows key trends in tourist visits to the canton of Vaud. There are noticeable dips in overnight stays at the year’s start, typically due to holidays, with a significant drop in 2020 due to the COVID pandemic. Mainly, tourists in canton Vaud are from Switzerland, followed by France (see appendix 8.2). Swiss tourists show a rising trend excluding the pandemic period. Overall, the data displays a seasonal pattern with fewer tourists at year-end and a peak during summer. Future forecasts should consider these seasonal variations and changing trends.
3.1.1 Decomposition
We have process an additive decomposition of the time series into three components: trend, seasonality and residual. These components will allow us to understand how they contribute to the variations observed in Swiss tourism data. see the appendix for more details on the decomposition
Click to show code
# Convert data to a time series objectvaud_ts <- tourism_vaud_total %>%arrange(Date) %>%# Filtre pour enlever les valeurs NA dans 'Date'filter(!is.na(Date)) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date, na.rm =TRUE), max(Date, na.rm =TRUE), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date, na.rm =TRUE))))# Perform STL decompositionstl_vd <-stl(vaud_ts, s.window ="periodic") %>%autoplot() +labs(x ="Date", y ="Number of tourists", title ="STL Decomposition for number of tourists in canton Vaud") +theme_minimal()
The decomposition analysis unveils significant trends in the data: a clear upward trajectory until 2020 followed by a sharp decline due to pandemic-induced travel restrictions, alongside evident monthly seasonality marked by regular fluctuations. Notably, the residual component remains stable until 2020, after which a slight uptick in volatility suggests potential external influences not captured by trend and seasonality.
4 EDA - Zurich
As for Vaud, the most frequent visitors to Zurich are Swiss. Germany and United States are the two main foreign countries to visit Zurich. This can be explained by the fact that the canton of Zurich is closer to Germany and therefore easier to reach. The same applies to France with the canton of Vaud. The yellow curve represents the Philippines. The curve is flat and shows a considerably small number of trips from this country over the period. There is a drastic fall in 2020 caused by COVID-19. The pandemic has had a significant impact on the tourism industry worldwide. At first glance, there are regular seasonal peaks for most countries which also suggest the presence of seasonality in tourism in the canton of Zurich. check the Appendix for more details with a Overall Zurichs Visitors graph
4.1 Zurich and Philippines Visitors
This graph shows only visitors from the Philippines, as this is the country of interest in our analysis.
Click to show code
##### EDA ZH ALL ###### Preparing the data#removing value in column 'Herkunftsland' as it is just the whole of Switzerlanddata <- tourism_data_zurich %>%filter(!is.na(value)) %>%# Removing rows with NA values in the 'value' columnmutate(Monat =month(Date, label =TRUE, abbr =TRUE), # Extract month Jahr =year(Date)) %>%# Extract year from Dategroup_by(Herkunftsland, Date) %>%# Group by country and datesummarise(Trips =sum(value), .groups ='drop') # Summing up trips for each country per datep <-ggplot(data, aes(x = Date, y = Trips, group = Herkunftsland,color = Herkunftsland =="Philippinen",text =paste("Country:", Herkunftsland, "<br>Trips:", Trips))) +# Added text for tooltipgeom_line(show.legend =FALSE) +scale_color_viridis_d() +# Use viridis color palettelabs(title ="Number of Trips from Each Country to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly objectinteractive_plot <-ggplotly(p, tooltip ="text", width =600, height =600)# Adjust plotly settings interactive_plot_zh_all <- interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80), # Adjust marginsshowlegend =FALSE# Show legend )##### EDA ZH PHILIPPINES ###### Use tourism_data_zurich_philippines data to plot the values in y axis and Date in x axisp <-ggplot(tourism_data_zurich_philippines, aes(x = Date, y = value)) +geom_line(color =viridis(1)) +# Use viridis color palette for a single linelabs(title ="Number of Trips from Philippines to Zurich",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly object with specified width and heightinteractive_plot <-ggplotly(p, width =600, height =300)# Adjust plotly settingsinteractive_plot <- interactive_plot %>%layout(showlegend =FALSE# Remove legend if any )# Display the interactive plotinteractive_plot
4.1.1 Pattern
We choose a multiplicative model for the STL decomposition of the number of trips from the Philippines to Zurich because the data (previous plot) shows increasing variance and larger fluctuations as the number of trips grows, suggesting that seasonal and trend components scale proportionally with the overall level of the series. check the appendix for the decomposition and further seasonal graph
4.1.1.1 Decomposition
Click to show code
# Convert data to a time series objecttourism_ts <- tourism_data_zurich_philippines %>%arrange(Date) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date), max(Date), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(log(value), frequency =12, start =decimal_date(min(Date))))# Perform STL decomposition on the transformed datatourism_stl <-stl(tourism_ts, s.window ="periodic")# Plotting the results (transforming back by exponentiating)stl_zh_m <-autoplot(tourism_stl) +labs(x ="Date", y ="Number of tourists", title ="Multiplicative STL Decomposition for number of tourists from Philippines to Zurich") +theme_minimal()# Convert data to a time series objecttourism_ts <- tourism_data_zurich_philippines %>%arrange(Date) %>%# Ensure data is complete and monthlycomplete(Date =seq.Date(min(Date), max(Date), by ="month")) %>%replace_na(list(value =0)) %>%# Replace NA values if there are any# Create a time series objectwith(ts(value, frequency =12, start =decimal_date(min(Date))))# Perform STL decompositionstl_zh <-stl(tourism_ts, s.window ="periodic") %>%autoplot() +labs(x ="Date", y ="Number of tourists", title ="STL Decomposition for number of tourists from Philippines to Zurich") +theme_minimal()# several chart per month to see the seasonalityseaso_plot_zh_2 <-ggsubseriesplot(tourism_ts) +ylab("Number of tourists") +xlab("Month") +ggtitle("Seasonal subseries plot")#debug#better to use gg_subseries to see the seasonality#tourism_ts %>% gg_subseries(value) + ylab("Number of tourists") + xlab("Month") + ggtitle("Seasonal subseries plot")
We noted :
An upward trend until around 2020, when it falls sharply because of the pandemic and travel restrictions. The pandemic had a longer effect on the Philippine tourism, which stopped for a longer period (around 2 years or more).
Multiple peaks in the seasonal monthly component. These fluctuations are due to their calendar. Philippines start their summer holidays earlier than we do (31 of May - 29 of July) and have longer La Toussaint holidays (5 October - 18 October - 28 October).
A residual component with moderate variability which increases from 2020 onwards, indicating the influence of unforeseen or exceptional events (such as the pandemic) that have disrupted the usual models.
5 Modelling
In this section, we will apply forecasting techniques to model tourism data for Zurich, aiming to predict future trends.
5.1 Total number of visitors in Vaud
We will forecast the total number of visitors in Vaud over a 15-month horizon using ETS, ARIMA, and TSLM models.
5.1.1 ETS model
The ETS (Error, Trend, Seasonality) model is used due to its simplicity and reliability. We employ the ETS “AAA” model, which considers additive error, trend, and seasonality components. The forecast follows trends and seasonality but shows high uncertainty.
Click to show code
ets_vaud <-ets(vaud_ts, model ="AAA")forecast_ets_vaud <-forecast(ets_vaud, h =15) # Calculate the accuracy of the training setmetrics_ets_vaud <- ets_vaud %>%accuracy() %>%kable("html", caption ="ETS AAA Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))# Extract and print the AICaic_ets_vaud <-AIC(ets_vaud)
5.1.2 ARIMA model
The automatic ARIMA (AutoRegressive Integrated Moving Average) model selects optimal parameters to capture data patterns. Initially, ARIMA(5,0,0)(0,1,1) was chosen automatically, reflecting the seasonal patterns and a slight downward trend due to the 2020 drop, however, currently, there are no evident reasons for a decrease in the number of visitors. Adjusting to ARIMA(5,1,0)(0,1,1) improves trend capture and aligns forecasts with observed patterns. The model`s explicitly was changed using a differencing order of 1 in the non-seasonal component.
Click to show code
# Convert your time series data to a tsibble objectvaud_tsibble <-as_tsibble(vaud_ts)# Fit an ARIMA modelfit_arima_vaud <- vaud_tsibble %>%model(ARIMA_auto =ARIMA(value),ARIMA_dif =ARIMA(value ~pdq(5, 1, 0) +PDQ(0, 1, 1)) )# Generate forecasts for the next 15 monthsforecast_arima_vaud <- fit_arima_vaud %>%forecast(h =15)# Plot the forecasts along with the historical dataplot_arima_vaud <- forecast_arima_vaud %>%autoplot(vaud_tsibble, alpha =0.3) +labs(title ="ARIMA Forecast of Tourists in Vaud",x ="Date",y ="Number of Tourists") +guides(colour =guide_legend(title ="Forecast"))# Calculate the accuracy of the training setmetrics_arima_vaud <- fit_arima_vaud %>%accuracy() %>%kable("html", caption ="ARIMA Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))# Extract and print the AIC for each modelaic_arima_vaud <- fit_arima_vaud %>%glance() %>%select(.model, AIC)
5.1.3 TSLM
In light of the significant drop in tourist numbers caused by the COVID-19 pandemic in 2019, it is crucial to address this anomaly effectively in our forecasting methodologies. The COVID-19 pandemic significantly impacted tourist numbers. We incorporate the Stringency Index from the OxCGRT dataset (taken from Our World In Data) into our Time Series Linear Regression (TSLM) model. Assuming no restrictions for the next 15 months, we set the stringency index to 0, aiming for an optimistic forecast reflecting pre-pandemic travel levels.
Click to show code
stringency_df <-read.csv(here("data/stringency_index.csv"))stringency_switzerland <-filter(stringency_df, location =="Switzerland")stringency_switzerland$date <-as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))# Aggregate to monthly average, ensuring date format is maintainedstringency_switzerland <- stringency_switzerland %>%group_by(date) %>%summarize(avg_stringency_index =mean(stringency_index, na.rm =TRUE))# Rename date column tourism_vaud_total <- tourism_vaud_total %>%rename(date = Date)# Then merge with Switzerland datamerged_vaud <-merge(tourism_vaud_total, stringency_switzerland, by ="date", all.x =TRUE)# Replace NA values in avg_stringency_index with 0 if necessarymerged_vaud <- merged_vaud %>%mutate(avg_stringency_index =replace_na(avg_stringency_index, 0))# Convert merged_vaud data frame to tsibble objectmerged_vaud_tsibble <- merged_vaud %>%as_tsibble(index = date)# Define the modelModel_vaud_stringency <- merged_vaud_tsibble %>%model(TSLM_vaud_str =TSLM(value ~trend() +season() + avg_stringency_index) )# Generate future datesfuture_dates <-seq.Date(from =as.Date("2023-10-01"), by ="month", length.out =15)# Create future data with avg_stringency_index set to 0 for 15 monthsfuture_data <-tibble(date = future_dates,avg_stringency_index =0) %>%as_tsibble(index = date)# Forecastforecast_vaud_stringency <- Model_vaud_stringency %>%forecast(new_data = future_data)# Calculate the accuracy of the training setmetrics_tslm_vaud <- Model_vaud_stringency %>%accuracy() %>%kable("html", caption ="TSLM Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))# Extract and print the AIC for the TSLM modelaic_tslm_vaud <- Model_vaud_stringency %>%glance() %>%filter(.model =="TSLM_vaud_str") %>%pull(AIC)
5.2 Zurich and Philippines
In this section, we will focus on the number of visitors from the Philippines to Zurich. We will use the Naive as a benchmark for ETS and ARIMA models to forecast the number of visitors from the Philippines to Zurich over a 15-month horizon.
5.2.1 Forecast without dealing with Covid
As we have seen, Covid has had a significant impact on the number of tourists in Zurich. We will first forecast the number of tourists without taking into account the Covid period. This will allow us to see the impact of the pandemic on the forecasts.
5.2.1.1 ETS
We first run a NAIVE model to have a benchmark for the other model. And a AAA model for the ETS model for ensuring that the model is indeed worse than a multiplicative one, based on our previous observations. see appendix for the plots and metrics of those 2 models
Click to show code
###### NAIVE part ######convert tourism_ts to tsibbletourism_ts <- tourism_ts %>%as_tsibble()# Fit a naive modelfit <- tourism_ts %>%model(NAIVE =NAIVE(value))# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =15)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot_naive <- forecast %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="Forecast of tourists from Philippines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))metrics_naive <- fit %>%accuracy()# Display accuracy metrics in an HTML tablemetrics_naive <- metrics_naive %>%kable("html", caption ="Naive Model Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))###### ETS PART AAA ###### Fit an ETS model# Adjusting the model parameters according to the characteristics of the data# Here "A" means additive error, "N" means no trend, and "N" means no seasonality# change these if neededfit <- tourism_ts %>%model(ETS_M_seaso =ETS(value ~error("A") +trend("A") +season("A")))# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =15)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot_ets_AAA <- forecast %>%autoplot(tourism_ts, alpha =0.5) +labs(title ="Forecast of tourists from Philippines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))# Calculate the accuracy of the training setmetrics_ets_AAA <- fit %>%accuracy() %>%kable("html", caption ="ETS AAA Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))##### ETS ####fit <- tourism_ts %>%model(ETS_M_seaso_Ad =ETS(value ~error("A") +trend("Ad") +season("M")))# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =15, level =95)# Extract the forecasts for the ETS_M_seaso_Ad modelforecast_ETS_M_seaso_Ad <- forecast %>%filter(.model =="ETS_M_seaso_Ad") %>%mutate(# Extract mean forecastsETS_M_seaso_Ad_Forecast = .mean,# Extract 95% confidence intervals`95%`=hilo(value, level =95) ) %>%# Unpack the 95% confidence intervalsunpack_hilo(`95%`, names_sep ="_")# Now, rename columns as neededforecast_ETS_M_seaso_Ad <- forecast_ETS_M_seaso_Ad %>%rename(Date = index,ETS_Lo_95 =`95%_lower`,ETS_Hi_95 =`95%_upper` )# Select the relevant columnsforecast_ETS_M_seaso_Ad <- forecast_ETS_M_seaso_Ad %>%select(Date, ETS_M_seaso_Ad_Forecast, ETS_Lo_95, ETS_Hi_95)# comparing several modelfit <- tourism_ts %>%model(ETS_M_seaso =ETS(value ~error("A") +trend("A") +season("M")), #multiplicative seasonalityETS_M_seaso_Ad =ETS(value ~error("A") +trend("Ad") +season("M")), #dampted trend )# Forecast the next 2 years periodsforecast <- fit %>%forecast(h =15)# Plot the forecasts along with the historical data, and make the colors of the forecast a bit more transparent for distinguishably purposesplot_ets_AAM_AAdM <- forecast %>%autoplot(tourism_ts, level =90, color ="blue", alpha =0.5) +labs(title ="Forecast of tourists from Philippines to Zurich",x ="Date",y ="Number of tourists") +guides(colour =guide_legend(title ="Forecast"))# Extract AIC valuesaic_values <- fit %>%glance() %>%select(.model, AIC)# Print the AIC valuesprint(aic_values)#> # A tibble: 2 x 2#> .model AIC#> <chr> <dbl>#> 1 ETS_M_seaso 3256.#> 2 ETS_M_seaso_Ad 3195.# Display AIC values with forecast metricsmetrics <- fit %>%accuracy()metrics_with_aic <- metrics %>%left_join(aic_values, by =".model")# Display metrics with AIC in an HTML tablemetrics_ets <- metrics_with_aic %>%kable("html", caption ="Model Accuracy Metrics with AIC Values") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))
Now we will forecast the number of tourists from the Philippines to Zurich using the ETS AAM AND AAdM models. see the appendix for a vizualisation and metrics of the 2 models
We observe that the damped model is better on almost all metrics.
5.2.1.2 ARIMA
We also did 3 ARIMA models for comparison purposes but as they were not as good as the ETS models, we will not go into details here. see the appendix for the vizualisation of the 3 models
Click to show code
# Fit an automatic ARIMA modelfit_arima <- tourism_ts %>%model(ARIMA_auto =ARIMA(value),ARIMA_auto_seasonal =ARIMA(value~season()),ARIMA_auto_stepwise =ARIMA(value, stepwise =FALSE, approximation =FALSE))# Forecast the next 15 monthsforecast_arima <- fit_arima %>%forecast(h =15)# Plot the forecasts along with the historical dataplot_arima_zh <- forecast_arima %>%autoplot(tourism_ts, alpha =0.3) +labs(title ="ARIMA Forecast of Tourists from the Philippines to Zurich",x ="Date",y ="Number of Tourists") +guides(colour =guide_legend(title ="Forecast"))
You can also check the appendix for the metrics of the ARIMA models
Click to show code
# Extract AIC valuesaic_values <- fit_arima %>%glance() %>%select(.model, AIC)# Calculate the accuracy of the training setmetrics_arima <- fit_arima %>%accuracy()# Combine accuracy metrics with AIC valuesmetrics_arima <- metrics_arima %>%left_join(aic_values, by =".model")# Display accuracy metrics with AIC values in an HTML tablemetrics_arima_table <- metrics_arima %>%kable("html", caption ="Model Accuracy Metrics with AIC Values") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))
5.2.2 Forecasting Amidst the Covid-19 Pandemic
Now we are going to take into account the Covid period and see how this new information will impact our prediction models. Indeed, as we have seen for vaud the pandemic had a significant impact on the number of tourists.
5.2.2.1 Data Integration
See the appendix for a vizualisation of the stringency index for the Philippines and Switzerland and more details on how it was fetched
Click to show code
# Convert data to a time series objecttourism_ts <- tourism_data_zurich_philippines#rename Date as datenames(tourism_ts)[names(tourism_ts) =="Date"] <-"date"#read .csv with stringency indexstringency_df <-read.csv(here("data/stringency_index.csv"))# Filter data by locationstringency_philippines <-filter(stringency_df, location =="Philippines")stringency_switzerland <-filter(stringency_df, location =="Switzerland")# Convert dates and set them to the first day of the monthstringency_philippines$date <-as.Date(format(dmy(stringency_philippines$date), "%Y-%m-01"))stringency_switzerland$date <-as.Date(format(dmy(stringency_switzerland$date), "%Y-%m-01"))# Aggregate to monthly average, ensuring date format is maintainedstringency_philippines <- stringency_philippines %>%group_by(date) %>%summarize(avg_stringency_index =mean(stringency_index, na.rm =TRUE))stringency_switzerland <- stringency_switzerland %>%group_by(date) %>%summarize(avg_stringency_index =mean(stringency_index, na.rm =TRUE))# Merge with Philippines data firstmerged_data_philippines <-merge(tourism_ts, stringency_philippines, by ="date", all.x =TRUE)# Then merge with Switzerland datamerged_data <-merge(merged_data_philippines, stringency_switzerland, by ="date", all.x =TRUE, suffixes =c("_PH", "_SW"))# Replace NA values in avg_stringency_index with 0 if necessarymerged_data$avg_stringency_index_PH[is.na(merged_data$avg_stringency_index_PH)] <-0merged_data$avg_stringency_index_SW[is.na(merged_data$avg_stringency_index_SW)] <-0# Create a ggplot of the stringency indexstringency_zh <-ggplot(merged_data, aes(x = date, y = avg_stringency_index_PH, color ="Philippines")) +geom_line() +geom_line(aes(y = avg_stringency_index_SW, color ="Switzerland")) +labs(title ="Stringency Index in the Philippines and Switzerland",x ="Date",y ="Stringency Index") +scale_color_manual(values =c("#3C5B6F", "darkred"),labels =c("Philippines", "Switzerland"))
We observe that Philippines had a higher stringency index than Switzerland all over the period.
You can observe in the appendix the resulted data from the merge of the stringency index and the tourism data
Click to show code
#show merge data using reactablestringency_table <-reactable(merged_data, searchable =TRUE,defaultPageSize =5,sortable =TRUE,filterable =TRUE,pagination =TRUE,bordered =TRUE,highlight =TRUE,striped =TRUE,compact =TRUE,showPageSizeOptions =TRUE,showSortable =TRUE )
5.2.2.2 ARIMAX
In our multiple regression analysis, with the number of tourists as the value of interest and the average stringency index as the explanatory variable for Switzerland and Philippines. In the table of characteristics, see Appendix, the stringency indices have opposite impacts: an increase in stringency in the Philippines is associated with an increase in tourism, while an increase in stringency in Switzerland is associated with a decrease in tourism, for a confidence interval of 95%. These results are counter-intuitive. Furthermore, only 9.3% of the variance in the model is explained, which means that the stringency index is not the only factor that influences the number of tourists from the Philippines to Zurich. We will apply an ARIMAX model to take into account the stringency index as an exogenous variable. We fix seasonality as TRUE and stepwise and approximation as FALSE to have a more thorough search for the best model, as we saw before this provided better results. see the appendix for a vizualisation and metrics of the ARIMAX model
Click to show code
# Transform to a time series object with frequency 12 (monthly data)tourism_ts <-ts(merged_data$value, frequency =12, start =c(2005, 1)) # Adjust the start time as per your data# Ensure exogenous variables have the same length and frequencyexog_data <-cbind(merged_data$avg_stringency_index_PH, merged_data$avg_stringency_index_SW)# Check if lengths of tourism_ts and exog_data matchif (length(tourism_ts) ==nrow(exog_data)) {# Fit an ARIMAX model model <-auto.arima(tourism_ts, xreg = exog_data, seasonal =TRUE, stepwise =FALSE, approximation =FALSE)# Summary of the modelsummary(model)# Forecast the next 15 months, setting future exogenous variables to 0 future_exog <-matrix(0, nrow =15, ncol =2) forecast_values <-forecast(model, xreg = future_exog, h =15)# Convert the forecast to a data frame forecast_arimax <-as.data.frame(forecast_values)# Rename the columns for claritycolnames(forecast_arimax) <-c("Point Forecast", "Lo 80", "Hi 80","Lo 95", "Hi 95") forecast_arimax$Date <-seq(as.Date("2023-10-01"), by ="month",length.out =15)# Plot the forecast along with the actual data using autoplot from the forecast package plot_forecast_arimax <-autoplot(forecast_values, series ="Forecast") +autolayer(tourism_ts, series ="Actual Data") +labs(title ="ARIMAX Forecast of Tourists from the Philippines to Zurich",x ="Date",y ="Number of Tourists") +guides(colour =guide_legend(title ="Data Type"))# Calculate evaluation metrics on the training data residuals <-residuals(model) mae <-mean(abs(residuals)) mape <-mean(abs(residuals / tourism_ts)) *100 rmse <-sqrt(mean(residuals^2)) aic <-AIC(model) bic <-BIC(model)# Print evaluation metricscat("Evaluation Metrics:\n")cat("MAE:", mae, "\n")cat("MAPE:", mape, "\n")cat("RMSE:", rmse, "\n")cat("AIC:", aic, "\n")cat("BIC:", bic, "\n")} else {stop("Length of tourism_ts and exog_data do not match. Please check the data.")}#> Evaluation Metrics:#> MAE: 65 #> MAPE: 74.9 #> RMSE: 105 #> AIC: 2744 #> BIC: 2774
Overall, ARIMA_auto_stepwise shows the best balance in terms of RMSE and AIC, while ARIMAX excels in MAE and MAPE. Depending on which metric is prioritized, either ARIMAX or ARIMA_auto_stepwise would be the preferred model.
see the appendix for other ideas that we had and that we could have implemented
6 Model Selection
6.1 Vaud
In the model selection process, we compare the performance of different forecasting models using various metrics to determine which model provides the most accurate and reliable forecasts for the number of tourists in Vaud.
In comparing these models, we prioritize metrics such as AIC, RMSE, MAE and MAPE as lower values indicate better forecasting accuracy. After evaluating the performance of the models, it is evident that ARIMA (5,1,0)(0,1,1) consistently achieves the lowest AIC, RMSE and MAE. Additionally, considering metrics MASE, which provide insights into the relative accuracy of the models compared to a baseline, ARIMA (5,1,0)(0,1,1) emerges as the most favorable choice. Therefore, based on these criteria, ARIMA (5,1,0)(0,1,1) is the preferred model for forecasting the number of tourists in Vaud. A MAPE value of 9.04 means that, on average, the forecasts produced by the chosen model, ARIMA (5,1,0)(0,1,1), deviate from the actual values by approximately 9.04%.This suggests that the model is performing reasonably well in capturing the underlying patterns and trends in the data, but there is still room for improvement.
Click to show code
# Filter forecasts for ARIMA_dif modelforecast_arima_dif <- forecast_arima_vaud %>%filter(.model =="ARIMA_dif")# Extract month and year from the indexforecast_arima_dif$Month <-substr(forecast_arima_dif$index, 6, 8)forecast_arima_dif$Year <-substr(forecast_arima_dif$index, 1, 4)# Create a Date column in "Year Month" formatforecast_arima_dif$Date <-paste(forecast_arima_dif$Year, forecast_arima_dif$Month)# Remove Month and Year columns if neededforecast_arima_dif <-subset(forecast_arima_dif, select =-c(Month, Year))# Extract mean and variance from the "value" columnmean_variance <-str_extract_all(forecast_arima_dif$value, "[0-9.]+")#> Warning in stri_extract_all_regex(string, pattern, simplify =#> simplify, : argument is not an atomic vector; coercing# Extract mean and variance valuesmean_values <-as.numeric(sapply(mean_variance, function(x) x[1]))variance_values <-as.numeric(sapply(mean_variance, function(x) x[2]))# Calculate standard deviationsstandard_deviations <-sqrt(variance_values)# Calculate lower and upper confidence intervalsforecast_arima_dif$Lower_CI <- mean_values -1.96* standard_deviationsforecast_arima_dif$Upper_CI <- mean_values +1.96* standard_deviations# Create reactable table with selected columnsforecast_vaud_finaltable <-reactable( forecast_arima_dif[, c("Date", ".mean", "Lower_CI", "Upper_CI")],columns =list(Date =colDef(name ="Date"),Forecast_value =colDef(name =".mean"),Lower_CI =colDef(name ="Lower CI"),Upper_CI =colDef(name ="Upper CI") ),defaultColDef =colDef(format =colFormat(digits =2)))
6.2 Zurich and Philippines
The performance measures of each of the models we have predicted help us to select the model that best fits our data, i.e. the model that best fits between goodness of fit and complexity.
The main objective of this project is to accurately predict the number of visitors, which is why we will first analyse the best MAE and RMSE values. The model with the lowest values is ETS_M_seaso_Ad. It is therefore recommended because it shows more accurate predictions, which are crucial for ensuring accurate and usable information for decision-making in the Zurich tourism sector. However, ARIMAX has a better AIC and MAPE. Using both forecasts hand in hand could be a better approach to have a more robust forecast than selecting only one model. This ‘forecasting ensemble’ would provide increased robustness, reduced bias and improved overall performance.
Check the Appendix for the forecasted values
Click to show code
##### ARIMAX ###### Rename the columns for claritycolnames(forecast_arimax) <-c("ARIMAX_Forecast", "ARIMAX_Lo_80", "ARIMAX_Hi_80", "ARIMAX_Lo_95", "ARIMAX_Hi_95", "Date")# Convert the 'Date' column to Date type for mergingforecast_arimax$Date <-as.Date(forecast_arimax$Date)forecast_ETS_M_seaso_Ad$Date <-as.Date(forecast_ETS_M_seaso_Ad$Date)forecast_ETS_M_seaso_Ad#> # A tibble: 15 x 4#> Date ETS_M_seaso_Ad_Forecast ETS_Lo_95 ETS_Hi_95#> <date> <dbl> <dbl> <dbl>#> 1 2023-10-01 1126. 974. 1272.#> 2 2023-11-01 882. 723. 1040.#> 3 2023-12-01 1574. 1358. 1788.#> 4 2024-01-01 460. 300. 612.#> 5 2024-02-01 418. 250. 590.#> 6 2024-03-01 527. 322. 731.#> 7 2024-04-01 1098. 735. 1448.#> 8 2024-05-01 1094. 721. 1454.#> 9 2024-06-01 1002. 646. 1346.#> 10 2024-07-01 864. 532. 1181.#> 11 2024-08-01 730. 427. 1022.#> 12 2024-09-01 921. 539. 1291.#> 13 2024-10-01 1179. 698. 1650.#> 14 2024-11-01 918. 525. 1301.#> 15 2024-12-01 1633. 941. 2306.# Merge the two data frames based on the Datecombined_forecast <-merge(forecast_arimax, forecast_ETS_M_seaso_Ad, by ="Date")# Calculate the average forecast and average confidence intervalscombined_forecast <- combined_forecast %>%mutate(Avg_Forecast = (ARIMAX_Forecast + ETS_M_seaso_Ad_Forecast) /2,Avg_Lo_95 = (ARIMAX_Lo_95 + ETS_Lo_95) /2,Avg_Hi_95 = (ARIMAX_Hi_95 + ETS_Hi_95) /2 )# Select the relevant columnsfinal_forecast <- combined_forecast %>%select(Date, Avg_Forecast, Avg_Lo_95, Avg_Hi_95)# Display the combined table using reactablefinal_forecast_table_zh <-reactable( final_forecast,columns =list(Date =colDef(name ="Date"),Avg_Forecast =colDef(name ="Average Forecast",format =colFormat(digits =2) ),Avg_Lo_95 =colDef(name ="Average 95% Lower CI",format =colFormat(digits =2) ),Avg_Hi_95 =colDef(name ="Average 95% Upper CI",format =colFormat(digits =2) ) ),defaultPageSize =5,highlight =TRUE,striped =TRUE,bordered =TRUE,filterable =TRUE,searchable =TRUE,sortable =TRUE,resizable =TRUE)#save final forecast in a .csv#write.csv(final_forecast, "../data/final_forecast.csv", row.names=FALSE)
We can see that there is a significant difference between the two models. The differences (delta) are sometimes very high, for example -351, indicating that the two models capture different seasonal components or trends. Combining the two and taking their average would provide robust and balanced predictions.
7 Conclusion
The project analyzed tourism data to predict the number of overnight stays of tourists in Vaud and from the Philippines to Zurich in the period from October 2023 to December 2024. The study focused on the cantons of Vaud and Zurich, adapting the analysis to specific regional and international tourism dynamics.
A preliminary analysis of the data revealed key patterns and trends, such as the significant impact of seasonal fluctuations and the COVID-19 pandemic on tourist behavior. In Vaud, the presence of a recurring seasonal pattern and a changing trend has highlighted the need for models that can adapt to such fluctuations. The analysis conducted in Zurich revealed specific periods of high tourist activity that correspond to the holiday calendar in the Philippines, which provided valuable information for the development of targeted tourism strategies.
The use of forecasting models such as ETS and ARIMA has demonstrated their effectiveness in identifying these complex patterns, providing reliable forecasts of future tourist flows. The models were selected based on their ability to account for seasonality, trends and significant impacts of global events such as the pandemic, which were important for accurate forecasting.
In conclusion, it should be noted that this analysis has revealed a special impact of global events on tourism. This data is important for strategic planning and decision-making in the tourism cantons of Vaud and Zurich. In the future, we recommend continued monitoring and adaptive forecasting to take into account unforeseen global impacts, ensuring sustainable growth of the tourism industry in these regions.
8 Appendix
8.1 General Overview
As the dataset provided is in German, we have translated the data in English to make it more intuitive and understandable for everyone. Then, we created a new ‘Date’ column, year-month-day, which corresponds to the correct format to be able to make predictions.
Here are the first 1000 instances of the raw data :
Click to show code
#show data using reactable only showing the first 100 rowsreactable(head(tourism_data_no_total, 1000), searchable =TRUE,defaultPageSize =5,sortable =TRUE,filterable =TRUE,pagination =TRUE,bordered =TRUE,highlight =TRUE,striped =TRUE,compact =TRUE,showPageSizeOptions =TRUE,showSortable =TRUE )
8.2 Vaud - Tourism Data
Click to show code
tourism_vaud#> # A tibble: 17,550 x 6#> Herkunftsland Kanton Monat Jahr value Date #> <chr> <chr> <fct> <chr> <dbl> <date> #> 1 Herkunftsland - Total Vaud January 2005 57942 2005-01-01#> 2 Schweiz Vaud January 2005 27853 2005-01-01#> 3 Baltische Staaten Vaud January 2005 103 2005-01-01#> 4 Deutschland Vaud January 2005 3052 2005-01-01#> 5 Frankreich Vaud January 2005 7667 2005-01-01#> 6 Italien Vaud January 2005 2154 2005-01-01#> 7 Österreich Vaud January 2005 192 2005-01-01#> 8 Vereinigtes Königreich Vaud January 2005 3695 2005-01-01#> 9 Irland Vaud January 2005 118 2005-01-01#> 10 Niederlande Vaud January 2005 1365 2005-01-01#> # i 17,540 more rows
8.3 VD - Plot visitors from countries
Click to show code
# Create the ggplot objectplot_vaud <- tourism_vaud %>%filter(Herkunftsland !='Herkunftsland - Total') %>%ggplot(aes(x = Date, y = value, group = Herkunftsland, color = Herkunftsland,text =paste("Country:", Herkunftsland, "Trips:", value))) +# Added text for tooltipgeom_line(show.legend =FALSE) +scale_color_viridis_d() +# Use viridis color palettelabs(title ="Number of visitors from Each Country to Vaud",x ="Date",y ="Number of Trips") +theme_minimal()# Convert to an interactive plotly object with specified width and heightinteractive_plot <-ggplotly(plot_vaud, tooltip ="text", width =600, height =400)# Adjust plotly settings interactive_plot %>%layout(margin =list(l =60, r =60, b =60, t =80),showlegend =FALSE# Remove legend )
8.4 Vd - STL
Click to show code
stl_vd
8.5 VD - ETS forecast
Click to show code
forecast_ets_vaud <-forecast(ets_vaud, h =15) %>%plot(main ="Forecast of visitors in Vaud", xlab ="Date", ylab ="Number of visitors")
8.6 VD - ARIMA forecast
Click to show code
plot_arima_vaud
8.7 VD - TSLM forecast
Click to show code
forecast_vaud_stringency %>%autoplot(merged_vaud_tsibble) +labs(title ="Tourist Forecast for Canton Vaud with Stringency Index = 0", y ="Number of Tourists", x ="Date") +guides(colour =guide_legend(title ="Forecast"))
8.8 ZH - Tourism Data
We filtered the “Kanton” column to keep only the canton of Zurich.
Here are the first 1000 instances of the raw data :
Click to show code
#show the data in a table using reactablereactable(head(tourism_data_zurich, 1000), searchable =TRUE,defaultPageSize =5,sortable =TRUE,filterable =TRUE,pagination =TRUE,bordered =TRUE,highlight =TRUE,striped =TRUE,compact =TRUE,showPageSizeOptions =TRUE,showSortable =TRUE )
Click to show code
table_zh_ph
There are 1869 missing values for the two sub-datasets. These missing values come from the ‘value’ column, creating gaps in the time series. We’ll see later how we’re going to process them to do modelling.
8.9 ZH -Zurich and All visiting countries
The graph shows the monthly number of overnight stays in Zurich from tourists of different countries.
Click to show code
# Display the interactive plotinteractive_plot_zh_all
8.10 ZH - STL
The additive time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:
Click to show code
stl_zh
The multiplicative time series decomposition of the monthly overnight stays for tourists coming from the Philippines to the canton of Zurich shows:
Click to show code
stl_zh_m
8.11 ZH - Seasonality
Seasonal sub-series plot permit to better visualize the monthly fluctuations of each year, from 2005 to 2023.
Click to show code
# Plot the seasonality in one chartggseasonplot(tourism_ts, year.labels =TRUE, year.labels.left =TRUE) +scale_color_viridis_d() +theme_minimal()
Click to show code
seaso_plot_zh_2
We clearly observe here that the seasonality is not constant over time.
The months of May to July and October seem to have visitor peaks, which may indicate a high tourist season during this period. As we saw before, this is due to their calendar. The years 2022 and 2023 show a significant increase in visitor numbers compared with previous years. In particular, the months from May to October 2022 and 2023 show much higher values. This growth may be due to several factors, such as a post-pandemic recovery in travel or specific initiatives that have attracted more tourists.
8.12 ZH - Naive Forecast
The graph shows the historical trend in the number of tourists from the Philippines to Zurich and the forecasts for the next 15 months using the naive model. We took the graph representing the total number of tourists coming from the Philippines to Zurich.
Click to show code
plot_naive
This naive model predicts that future values will be equal to the last observed value in the time series. It does not take into account the past events like the pandemic and assumes here that the levels observed after this extreme fall will remain unchanged. The model does not take into account trends or seasonality neither, which are very present in our case. It’s a simplified approach.
The blue areas represent the 80% (darker) and 95% (lighter) confidence intervals of the forecasts. The wider the interval, the greater the uncertainty about the long-term forecasts which is the case here.
We can see here the metrics of the naive model :
Click to show code
metrics_naive
Naive Model Metrics
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
NAIVE
Training
3.44
135
79.2
-18.8
48.3
0.698
0.629
-0.3
8.13 ZH - ETS AAA model
Click to show code
plot_ets_AAA
Clearly see here that the confidence interval is too big, almost like a naive forecast, therefore as suspected we will move to a multiplicative seasonality model.
We see here the metrics of the ETS AAA model :
Click to show code
metrics_ets_AAA
ETS AAA Metrics
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
ETS_M_seaso
Training
5.5
108
73
-61.7
124
0.643
0.503
0.034
8.14 ZH - ETS M model
Click to show code
plot_ets_AAM_AAdM
Click to show code
metrics_ets
Model Accuracy Metrics with AIC Values
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
AIC
ETS_M_seaso
Training
2.93
85.7
55.3
-81.5
102
0.487
0.399
0.040
3256
ETS_M_seaso_Ad
Training
2.06
74.7
48.9
-78.7
105
0.431
0.348
0.181
3195
8.15 ZH - ARIMA Models
We apply three ARIMA models and determine which one gives us the best accury in term of prediction :
An ARIMA_auto, which will automatically determine the optimal values of the ARIMA parameters (p, d, q) using AIC criteria.
An ARIMA_auto_seasonal, which works like the ARIMA_auto but takes into account the monthly seasonality of our data.
ARIMA_auto_stepwise, which performs an exhaustive search over all possible combinations of parameters by disabling the stepwise and approximation options. It increases the accuracy of the model and do not take seasonal components into account.
Click to show code
plot_arima_zh
From the graph we can see that the ARIMA_auto_seasonal model gives more negative predictions over the next 15 months than the other two models. The ARIMA_auto_stepwise model predicts the highest values while the ARIMA_auto model lies between the two prediction curves.
Here are the metrics of the ARIMA model :
Click to show code
metrics_arima_table
Model Accuracy Metrics with AIC Values
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
AIC
ARIMA_auto
Training
6.74
113
67.7
-68.7
123
0.597
0.526
-0.024
2771
ARIMA_auto_seasonal
Training
4.47
105
70.4
-72.8
146
0.620
0.489
0.027
2751
ARIMA_auto_stepwise
Training
5.80
104
67.6
-72.2
143
0.596
0.485
-0.005
2738
According to the table of metrics, the ARIMA_auto_stepwise model appears to be the best choice with the lowest RMSE, MAE and AIC values. This indicates greater accuracy and better model fit despite the computation time involved with this type of model. If computing resources permit, this time cost is recouped by greater accuracy and therefore better fit. ARIMA_auto_seasonal is very close to the stepwise model in terms of metrics but shows good model accuracy.
8.16 ZH - Stringency Index
The OxCGRT provides data on government responses to COVID-19, including a Stringency Index that measures the severity of lockdown measures. The Stringency Index is calculated based on nine key metrics: School closures, Workplace closures, Cancellation of public events, Restrictions on public gatherings, Closures of public transport, Stay-at-home requirements, International travel controls, etc… Each metric is scored from 0 to 100, with higher scores indicating stricter responses. The overall Stringency Index is the average score of these metrics, reflecting the strictest sub-regional policies if they vary.
Simple yet effective, if the relationships between the covariates and the dependent variable (tourist numbers) are linear.
Here is a summary of the effect of the variables on the number of tourists from the Philippines to Zurich :
Click to show code
# Fit a multiple regression modelmodel <-lm(value ~ avg_stringency_index_PH + avg_stringency_index_SW, data = merged_data)# Summary of the modelsummary(model)#> #> Call:#> lm(formula = value ~ avg_stringency_index_PH + avg_stringency_index_SW, #> data = merged_data)#> #> Residuals:#> Min 1Q Median 3Q Max #> -318.9 -157.3 -69.3 93.7 873.7 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 246.32 15.85 15.54 <2e-16 ***#> avg_stringency_index_PH 5.87 2.55 2.30 0.0221 * #> avg_stringency_index_SW -12.19 3.73 -3.26 0.0013 ** #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 220 on 222 degrees of freedom#> Multiple R-squared: 0.0931, Adjusted R-squared: 0.0849 #> F-statistic: 11.4 on 2 and 222 DF, p-value: 1.95e-05# Forecast the next 24 monthsforecast_values <-predict(model, newdata = merged_data)#us gtsummary to show the summary of the modelmodel %>% gtsummary::tbl_regression()
Characteristic
Beta
95% CI1
p-value
avg_stringency_index_PH
5.9
0.85, 11
0.022
avg_stringency_index_SW
-12
-20, -4.8
0.001
1 CI = Confidence Interval
We observe that the p-value are lower for Switzerland than for Philippines. This means that government stringency index in Switzerland has a more significant impact on the number of tourists than in the Philippines. Indeed, the beta of -12 for Switzerland and 5.9 for Philippines shows that the number of tourists in Switzerland decreases by 12 for each unit of stringency index, which seems logical. However the number of tourists in the Philippines increases by 5.9 for each unit of stringency index, which is counter-intuitive. This could be due to the fact that the stringency index is not the only factor that influences the number of tourists from the Philippines to Zurich.
We can observe here the metrics:
Click to show code
#create a table with the metrics of the model and show it as an htmlmodel_metrics <- model %>% broom::glance()# show html table with metricsmodel_metrics %>%kable("html", caption ="Model Metrics") %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"))
Model Metrics
r.squared
adj.r.squared
sigma
statistic
p.value
df
logLik
AIC
BIC
deviance
df.residual
nobs
0.093
0.085
220
11.4
0
2
-1531
3070
3084
10734309
222
225
The way higher AIC shows that this model is not the best.
8.18 ZH - Covid
The COVID-19 pandemic is considered a Black Swan event due to its unpredictable nature and global impact. The question is how can you incorporate this event in model predictions ? Adjust forecasts: include covariates that capture the effects of the pandemic. This allows us to adjust our forecasts to take account of the impact of COVID-19. For example, you could include a covariate that captures the effect of closures or travel restrictions on tourism data. Scenario analysis: run scenario simulations to analyse the potential impact of different COVID-19 scenarios on our forecasts. By analysing the different possible outcomes, you can better prepare for the uncertainties associated with the pandemic. Sensitivity analysis: assess the sensitivity of your forecasts to changes in assumptions or key parameters. By performing a sensitivity analysis, you can identify the factors that have the greatest impact on your forecasts and assess the robustness of your models.
8.19 ZH - ARIMAX Model
Click to show code
plot_forecast_arimax
Here are the metrics of the ARIMAX model with the metrics of the ARIMA model for comparison :
Click to show code
metrics_arimax
Model Metrics
Model
MAE
MAPE
RMSE
AIC
BIC
ARIMAX
65
74.9
105
2744
2774
Click to show code
metrics_arima_table
Model Accuracy Metrics with AIC Values
.model
.type
ME
RMSE
MAE
MPE
MAPE
MASE
RMSSE
ACF1
AIC
ARIMA_auto
Training
6.74
113
67.7
-68.7
123
0.597
0.526
-0.024
2771
ARIMA_auto_seasonal
Training
4.47
105
70.4
-72.8
146
0.620
0.489
0.027
2751
ARIMA_auto_stepwise
Training
5.80
104
67.6
-72.2
143
0.596
0.485
-0.005
2738
8.20 ZH - Other Ideas
Other ideas for taking into account the impact of Covid-19 in our prediction models would be: - Create a dummy variable that takes the value 1 during periods of confinement and severe restrictions, and 0 otherwise. This models the direct impact of the pandemic on tourist numbers. If there are missing values, the ARIMA method can be used to interpolate the missing values on the basis of the patterns observed in the available data with the use fill_gaps(). - Implement additional exogenous variables in our ARIMAX model, such as the number of Covid-19 deaths and/or the vaccination rate. - Jump regression models and GARCH models are powerful tools for analysing time series when there are sudden changes or periods of high volatility such as during the Covid-19 pandemic. - Apply machine learning techniques to capture non-linear and complex patterns such as random forest or gradient boosting or recurrent neural networks RNN.